A. Jordan Nafa and Andrew Heiss
University of North Texas and Georgia State University
September 16th, 2022
In many fields of political science baseline random assignment and experimental design are impossible
For many research questions, observational data is our only option but this makes causal inference difficult
Assumption of strict exogeneity is usually unrealistic and valid instruments are a rare beast (Liu, Wang, and Xu 2022; Mogstad and Torgovitsky 2018; Swamy, Tavlas, and Hall 2015)
Need to rely on weaker assumptions for causal identification (Acharya, Blackwell, and Sen 2016; Blackwell and Glynn 2018; Forastiere, Mattei, and Ding 2018)
How can we estimate causal effects and describe their uncertainty using observational data?
Cross-sectional time series data and causal inference
Causal inference literature in political science focuses largely on frequentist econometrics (i.e., Imai and Kim 2019, 2020)
Recent works drawing on approaches developed in biostatistics outline a framework for estimating causal effects under the relatively weaker assumption of sequential ignorability (Acharya, Blackwell, and Sen 2016; Blackwell and Glynn 2018)
Our goal in this paper is to extend the approach to causal inference under selection on observables introduced by Blackwell and Glynn (2018) to a Bayesian framework
Marginal structural models (MSMs) are a multi-stage approach to estimating causal effects where baseline random assignment is not possible (Robins 1997; Robins, Hernán, and Brumback 2000)
Relies on inverse probability of treatment weighting to achieve covariate balance by constructing pseudo-populations (Cole and Hernán 2008; Imai and Ratkovic 2015; Saarela et al. 2015)
Adjusting for biasing paths in the propensity model allows for identification of causal effects that are impossible to estimate in a single model due to post-treatment bias
Possible to estimate lagged effects and “treatment histories” in cross-sectional time series data under complex temporal dependence (Blackwell and Glynn 2018)
Frequentist uncertainty estimates are based on assumptions about sampling distributions
Yet, in many areas of political science our data comprise an apparent population rather than a random sample from a larger population of interest (Berk, Western, and Weiss 1995; Gill 2001)
It doesn’t make sense to think in terms of random samples from a population if your observed data is the population (Gill and Heuberger 2020; Western and Jackman 1994)
A Bayesian framework provides a straightforward approach to accounting for and propagating uncertainty in the specification of the propensity model
Bayesian Model Averaging (BMA) and cross-validation based stacking approaches allow us to avoid choosing a single specification for the propensity model (Kaplan and Chen 2014; Corwin Matthew Zigler and Dominici 2014)
Acknowledges that we are virtually always uncertain about what the true set of confounders is
May help reduce the degree to which our results depend on the propensity model being correctly specified (Hahn, Murray, and Carvalho 2020)
Inverse probability of treatment weights are an estimated quantity with associated uncertainty (Liao and Zigler 2020)
Most applications of IPTW methods in political science assume the weights are deterministic (i.e., Blackwell and Glynn 2018; Kurtz and Lauretig 2021; Ladam, Harden, and Windett 2018)
Need to propagate uncertainty in the design stage weights to outcome stage (Liao and Zigler 2020)
Highlights a problem with fully Bayesian estimation of MSMs which requires the models be estimated jointly (Robins, Hernán, and Wasserman 2015; Corwin M. Zigler et al. 2013)
\[ \definecolor{treat}{RGB}{27,208,213} \definecolor{outcome}{RGB}{98,252,107} \definecolor{baseconf}{RGB}{244,199,58} \definecolor{covariates}{RGB}{178,26,1} \definecolor{index}{RGB}{37,236,167} \definecolor{timeid}{RGB}{244,101,22} \definecolor{mu}{RGB}{71,119,239} \definecolor{sigma}{RGB}{219,58,7} \newcommand{normalcolor}{\color{white}} \newcommand{treat}[1]{\color{treat} #1 \normalcolor} \newcommand{resp}[1]{\color{outcome} #1 \normalcolor} \newcommand{conf}[1]{\color{baseconf} #1 \normalcolor} \newcommand{covar}[1]{\color{covariates} #1 \normalcolor} \newcommand{obs}[1]{\color{index} #1 \normalcolor} \newcommand{tim}[1]{\color{timeid} #1 \normalcolor} \newcommand{mean}[1]{\color{mu} #1 \normalcolor} \newcommand{vari}[1]{\color{sigma} #1 \normalcolor} \]
For some binary treatment \(\treat{X}_{\obs{i}\tim{t}}\), the posterior expectation of the stabilized inverse probability of treatment weights for each unit \(\obs{i}\) at time \(\tim{t}\) is
\[ \text{IPW}_{\obs{i}\tim{t}} = \prod^{\tim{t}}_{\tim{t} = \tim{1}} \frac{\int\Pr[\treat{X}_{\obs{i}\tim{t}}~ | ~\treat{X}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}]\pi(\theta)d\theta}{\int\Pr[\treat{X}_{\obs{i}\tim{t}}~ |~\covar{Z}_{\obs{i}\tim{t}}, ~ \treat{X}_{\obs{i}\tim{t-1}},~ \resp{Y}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}]\pi(\theta)d\theta} \]
\(\treat{X}_{\obs{i}\tim{t-1}}\) and \(\resp{Y}_{\obs{i}\tim{t-1}}\) denote the treatment status and outcome for the \(\obs{i^{th}}\) unit in the previous period respectively
\(\conf{C}_{\obs{i}}\) is a set of time-invariant baseline covariates
\(\covar{Z}_{\obs{i}\tim{t}}\) is a set of time-varying covariates that satisfies sequential ignorability
Although we focus mainly on the average treatment effect at times \(\tim{t}\) and \(\tim{t-1}\), it is possible to estimate longer lags and other estimands as well.
It is also possible to extend IPTW to cases in which \(\treat{X}_{\obs{i}\tim{t}}\) is continuous, in which case the stabilized weights are
\[\text{IPW}_{\obs{i}\tim{t}} = \prod^{\tim{t}}_{\tim{t} = \tim{1}} \frac{f_{\treat{X}_{\obs{i}\tim{t}} | \treat{X}_{\obs{i}\tim{t-1}},\conf{C}_{\obs{i}}}[(\treat{X}_{\obs{i}\tim{t}}~ | ~\treat{X}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}); ~\mean{\mu}, ~\vari{\sigma^{2}}]}{f_{\treat{X}_{\obs{i}\tim{t}} |\covar{Z}_{\obs{i}\tim{t}}, \treat{X}_{\obs{i}\tim{t-1}}, \resp{Y}_{\obs{i}\tim{t-1}}, \conf{C}_{\obs{i}}}[(\treat{X}_{\obs{i}\tim{t}}~ |~\covar{Z}_{\obs{i}\tim{t}}, ~ \treat{X}_{\obs{i}\tim{t-1}},~ \resp{Y}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}); ~\mean{\mu}, ~\vari{\sigma^{2}}]} \]
Each of the parameters \(\treat{X}\), \(\resp{Y}\), \(\covar{Z}\), and \(\conf{C}\) in the numerator and denominator are the same as in the binary version
The \(f_{\dots}(\cdot)\) expressions represent a probability density function with mean \(\mean{\mu}\) and variance \(\vari{\sigma^{2}}\)
We’ll focus mainly on binary treatment regimes, though this particular method tends to behave better for a continuous \(\treat{X}\) in some cases
To propagate uncertainty in the distribution of weights from the design stage while avoiding the problem of feedback inherent in joint estimation, we develop a Bayesian Pseudo-Likelihood estimator (Savitsky and Toth 2016; Williams and Savitsky 2020a, 2020b)
\[\begin{align} \hat{\pi}( \theta~|~y, \tilde{w}) ~\propto~ \left [\prod_{i = 1}^{n} \Pr(y_{i} ~|~ \theta)^{\tilde{w_{i}}}\right ]\pi(\theta) \end{align}\]\(\tilde{w_{i}}\) is the realized IPT weight for the \(i^{th}\) observation
\(\prod_{i = 1}^{n} \Pr(y_{i} ~|~ \theta)^{\tilde{w_{i}}}\) is the pseudo-likelihood and \(\pi\) denotes the prior probability for a parameter \(\theta\)
\(\hat{\pi}( \theta~|~y, \tilde{w})\) represents the Bayesian pseudo-posterior for \(\theta\)
We decompose the matrix of weights from the design stage into a location component \(\lambda\) and a scale component \(\delta\)
The weight for each observation is sampled as \[\tilde{w}_{\obs{i}\tim{t}} \sim \lambda_{\obs{i}\tim{t}} + \delta_{\obs{i}\tim{t}} \cdot \pi(\delta_{\obs{i}\tim{t}})\] where \(\pi(\delta_{\obs{i}\tim{t}})\) is a regularizing prior on the scale of the weights such as an exponential distribution with rate \(\lambda > 3.5\) or Beta distribution with shape parameters \(\alpha = 2\) and \(\beta \ge 2\)
Provides computational stability and shuts down extreme values when the IPT weights have high variance
Straightforward extensions for nested data structures via double-weighted estimation (Savitsky and Williams 2021)
To assess parameter recovery and bias, we adapt the original simulation design from Blackwell and Glynn (2018)
We simulate 2000 data sets of varying dimensions, manipulating the path \(\treat{X}_{\obs{i}\tim{t-1}} \longrightarrow \covar{Z}_{\obs{i}\tim{t}}\)
Periods \(\in \{20, 50\}\)
Groups \(\in \{25, 45, 65, 85, 100\}\)
Objectives
Identify both \(\treat{X}_{\obs{i}\tim{t}} \longrightarrow \resp{Y}_{\obs{i}\tim{t}}\) and \(\treat{X}_{\obs{i}\tim{t-1}} \longrightarrow \resp{Y}_{\obs{i}\tim{t}}\)
Compare our Bayesian Pseduo-Likelihood approach against the more common auto-regressive distributed lag (ARDL) specification
As illustrated in the equation for the stabilized weights, we specify two separate models for the numerator and denominator with weakly informative independent normal priors on \(\alpha\) and \(\beta\)
\[\begin{align} \Pr(\treat{X}_{\obs{i}\tim{t}} = 1 ~|~ \theta_{\obs{i}\tim{t}}) &\sim \textit{Bernoulli}(\theta_{\obs{i}\tim{t}})\\ &\theta_{\obs{i}\tim{t}} = \text{logit}^{-1}(\alpha + X_{n}\beta_{k})\\ \text{with priors}\\ \alpha &\sim \textit{Normal}(0, ~2) \quad \quad \beta_{k} \sim \textit{Normal}(0,~ 1)\\ \end{align}\]For the numerator model, the matrix \(X_{n}\) is simply \(\treat{X}_{\obs{i}\tim{t-1}}\)
For the denominator model, \(X_{n} = \{\covar{Z}_{\obs{i}\tim{t}}, ~ \treat{X}_{\obs{i}\tim{t-1}},~ \resp{Y}_{\obs{i}\tim{t-1}}\}\)
Overall, our proposed procedure performs well in terms of parameter recovery under fairly general conditions
Going forward, we need to apply this to some real world political science examples
Planned R package implementing our procedure by building on the brms package as a back end